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(54) Instruction cache associative cross-bar switch 



(57) A method for operating a processor, comprising 
the steps of storing in a memory a plurality of instruc- 
tions, each instruction being one of a plurality of instruc- 
tion types, the instructions encoded In frames, each 
frame including a plurality of instruction slots and tem- 
plate bits that specifies instruction group boundaries 
within the frame, with an instruction group comprising a 
set of statically contiguous instructions that are execut- 



ed concurrently; using a crossbar switch means coupled 
to a plurality of execution units to issue instructions in 
the instruction group in parallel to execution units from 
the plurality of execution units in response to the tem- 
plate bits; wherein each execution unit of the plurality of 
execution units are one of a plurality of execution unit 
types; and wherein each instruction type is executed on 
one or more execution unit types. 



SECONDARY 
CACHE 



/256 



PRE-OECOOER 



70- 



84 



PHYSICAL 
ADDRESS 



90- 



19 



TLB 



VIRTUAL -V19 
ADDRESS 
BiTS [31:13] 



IT) 

O) 

CD 
00 



•82 



512 



Brrs[45] 81 

30 

/BUS 13121 



r 8 73^>512 
/•BITS [12:5] 

1 



76 74 



2-WAYSET 1 




ASSOCIATE 1 




PRIMARY 256 


TAGS 


INSTRUCTION | 




CACHE 1 





100 



ASSOCIATIVE CROSSBAR 



MUX 



/-so 



64 64J 
^110 



PIPE 
0 



111 



117 



PIPE 
1 



,'64 



PIPE 
7 



TTTk 

AOORESSES 



32 BITS 



Q. 

UJ 



FIG. 2 



Printed by Jouve. 75001 PARIS (FR) 



1 



EP 1 186 995 A1 



2 



Description 

[0001] Thisinvention relates to amethodforoperating 
a processor according to claim 1 , a processor according 
to claim 11 and to a cache memory according to claim s 
21 , and thus to an architecture in which individual in- 
structions may be executed in parallel, as well as to 
methods and apparatus for accomplishing that. 
[0002] A common goal in the design of computer ar- 
chitectures (s to increase the speed of execution of a 
given set of Instructions. One approach to increasing in- 
struction execution rates is to issue more than one in- 
struction per clock cycle, in other words, to issue instruc- 
tions in parallel This allows the instruction execution 
rate to exceed the clock rate. Computing systems that 
issue multiple independent instructions during each 
clock cycle must solve the problem of routing the indi- 
vidual Instructions that are dispatched in parallel to their 
respective execution units. One mechanism used to 
achieve this parallel routing of instructions is generally 
called a "crossbar switch." 

[0003] In present state of the art computers, e.g. the 
Digital Equipment Alpha, the Sun Microsystems Super- 
Sparc, and the Intel Pentium, the crossbar switch is im- 
plemented as part of the instruction pipeline. In these 
machines the crossbar is placed between the instruction 
decode and instruction execute stages. This is because 
the conventional approach requires the instructions to 
be decoded before it is possible to determine the pipe- 
line to which they should be dispatched. Unfortunately, 
decoding in this manner slows system speed and re- 
quires extra surface area on the integrated circuit upon 
which the processor is formed. These disadvantages 
are explained further below. 
[0004] The invention is defined in claims 1,11 and 21 . 
[0005] A computing system architecture is provided 
that enables instructions to be routed to an appropriate 
pipeline more quickly, at lower power, and with simpler 
circuitry than previously possible. This invention places 
the crossbar switch earlier in the pipeline, making it a 
part of the initial instruction fetch operation. This allows 
the crossbar to be a part of the cache itself, rather than 
a stage in the instruction pipeline. It also allows the 
crossbar to take advantage of circuit design parameters 
that are typical of regular memory structures rather than 
random logic. Such advantages include: lower switching 
voltages (200 - 300 millivolts rather than 3 - 5 volts); 
more compact design, and higher switching speeds. In 
addition, if the crossbar is placed in the cache, the need 
for many sense amplifiers is eliminated, reducing the cir- 
cuitry required in the system as a whole. 
[0006] To implement the crossbar switch, the instruc- 
tions coming from the cache, or otherwise arriving at the 
switch, must be tagged or otherwise associated with a 
pipeline Identifier to direct the instructions to the appro- 
priate pipeline for execution. In other words, pipeline dis- 
patch information must be available at the crossbar 
switch at instruction fetch time, before conventional in- 



struction decode has occurred. There are several ways 
this capability can be satisfied: In one embodiment this 
system includes a mechanism that routes each instruc- 
tion in a set of instructions to be executed in parallel to 
an appropriate pipeline, as determined by a pipeline tag 
applied to each instruction during compilation, or placed 
in a separate identifying instruction that accompanies 
the original instruction. Alternately the pipeline affiliation 
can be determined after compilation at the time that in- 
structions are fetched from memory into the cache, us- 
ing a special predecoder unit. 
[0007] Thus, in one implementation, this system in- 
cludes a register or other means, for example, the mem- 
ory cells providing for storage of a line in the cache, for 
holding instructions to be executed in parallel. Each in- 
struction has associated with it a pipeline identifier in- 
dicative of the pipeline to which that instruction is to be 
Issued. A crossbar switch is provided which has a first 
set of connectors coupled to receive the instructions, 
and a second set of connectors coupled to the process- 
ing pipelines to which the instructions are to be dis- 
patched for execution. Means are provided which are 
responsive to the pipeline identifiers of the individual in- 
structions in the group supplied to the first set of con- 
nectors for routing those individual instructions onto ap- 
propriate paths of the second set of connectors, thereby 
supplying each instruction in the group to be executed 
in parallel to the appropriate pipeline. 
[0008] In a preferred embodiment of this invention the 
associative crossbar is implemented in the instruction 
cache. By placing the crossbar in the cache all switching 
is done at low signal levels (approximately 200 - 300 
millivolts). Switching at these low levels is substantially 
faster than switching at higher levels (5 volts) after the 
sense amplifiers. The lower power also eliminates the 
need for large driver circuits, and eliminates numerous 
sense amplifiers. Additionally, by implementing the 
crossbar in the cache, the layout pitch of the crossbar 
lines matches the pitch of the layout of the cache. 
[0009] A further embodiment of the invention con- 
cerns an apparatus in a computing system in which 
groups of individual instructions are executable in par- 
allel by processing pipelines, said apparatus being used 
for routing each instruction in a group to be executed in 
parallel to an appropriate pipeline comprises 

a storage for holding at least one group of instruc- 
tions to be executed in parallel, each instruction in 
the group having associated therewith a pipeline 
identifier indicative of the pipeline for executing that 
instruction; 

a crossbar having a first set of connectors coupled 
to the storage for receiving instructions therefrom 
and a second set of connectors coupled to the 
processing pipelines; 

means responsive to the pipeline identifier of the in- 
dividual instructions in the group for routing individ- 
ual instructions onto appropriate ones of the second 
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set of connectors, to thereby supply each instruc- 
tion in the group to be executed in parallel to the 
appropriate pipeline. 

[0010] In this apparatus, the first set of connectors 
may consist of a set of first communication buses, one 
for each instruction in the storage; 

the second set of connectors may consist of a set 
of second communication buses, one for each pipe- 
line; and 

the means responsive to the pipeline identifier may 
comprise a set of decoders coupled to the storage 
to receive as first input signals the pipeline identifi- 
ers and in response thereto supply as output signals 
a switch control signal; and a set of switches, cou- 
pled to the decoders, one switch at the intersection 
of each of the first set of connectors with the second 
set of connectors, the switches providing connec- 
tions in response to receiving the switch control sig- 
nal to thereby supply each instruction in the group 
to be executed in parallel to the appropriate pipe- 
line. 

[0011] A further embodiment of the invention con- 
cerns an apparatus in a computing system in which sets 
of individual instructions are executable in parallel by 
processing pipelines, said apparatus being used for 
routing each instruction in a group to be executed in par- 
allel to an appropriate pipeline comprises 

a storage for holding a collection of instructions, in- 
cluding at least one set of instructions to be execut- 
ed in parallel, each instruction in the set having as- 
sociated therewith a pipeline identifier indicative of 
the pipeline to which that instruction is to be issued; 
a crossbar switch having a first set of connectors 
coupled to the storage for receiving instructions 
therefrom and a second set of connectors coupled 
to the processing pipelines; 
selection means connected to receive the set of in- 
structions and connected to receive information 
about those instructions to be next executed in par- 
allel for supplying in response thereto an output sig- 
nal indicative of the next set of instructions to be 
executed in parallel; and 

decoder means coupled to receive the output signal 
and each of the pipeline identifiers of the instruc- 
tions in the storage for selectively connecting ones 
of the first set of connectors to ones of the second 
set of connectors to thereby supply each instruction 
in the set to be executed in parallel to the appropri- 
ate pipeline. 

[0012] In this apparatus, the first set of connectors 
may consist of a set of first communication buses, one 
for each instruction in the storage; 
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the second set of connectors may consist of a set 
of second communication buses, one for each pipe- 
line; 

the decoder means may comprise a set of decoders 
s coupled to receive as first input signals the pipeline 
identifiers and the information about the next group 
of instructions to be executed by the pipelines and 
in response thereto supply as output signals a 
switch control signal; and 
10 the crossbar switch may Include a set of switches, 
one at the intersection of each of the first set of con- 
nectors with the second set of connectors, the 
switches providing connections in response to re- 
ceiving the switch control signal to thereby supply 
15 each instruction in the group to be executed in par- 
allel to the appropriate pipeline. 

[0013] Further, the multiplexer may supply an output 
signal to the decoders to select the next group of instruc- 

20 tions to be supplied to the pipelines. 

[0014] A further embodiment of the invention con- 
cerns a method in a computing system in which groups 
of individual instructions are executable in parallel by 
processing pipelines, said method being used for trans- 

25 ferring each instruction in a group to be executed 
through a crossbar switch having a first set of connec- 
tors coupled to the storage for receiving instructions 
therefrom and a second set of connectors coupled to 
the processing pipelines, said method comprising: 

30 

storing in storage at least one group of instructions 
to be executed in parallel, each instruction in the 
group having associated therewith a pipeline iden- 
tifier indicative of the pipeline which will execute that 

35 instruction; and 

using the pipeline identifiers of the individual in- 
structions in the at least one group of instructions 
which are to be executed next to control switches 
between the first set of connectors and the second 

40 set of connectors to thereby supply each instruction 
in the group to be executed in parallel to the appro- 
priate pipeline. 

[0015] The step of using may comprise 

45 

supplying the pipeline identifiers of the individual in- 
structions in the at least one group of instructions 
to a corresponding number of decoders, each of 
which provides an output signal indicative of the 
50 pipeline identifiers; and 

using the decoder output signals to control the 
switches between the first set of connectors and the 
second set of connectors to thereby supply each in- 
struction in the group to be executed in parallel to 
55 the appropriate pipeline. 

[0016] According to one embodiment, there is provid- 
ed in a computing system in which groups of individual 
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instructions are executable in parallel by processing 
pipelines, a method for supplying each instruction in a 
group to be executed in parallel to an appropriate pipe- 
line, the method comprising: 

storing in storage at least one group of instructions 
to be executed in parallel, each instruction in the 
group having associated therewith a pipeline iden- 
tifier indicative of the pipeline which will execute that 
instruction; and 

using the pipeline identifier of those instructions to 
be next executed in parallel to control switches in a 
crossbar switch having a first set of connectors cou- 
pled to the storage for receiving instructions there- 
from and a second set of connectors coupled to the 
processing pipelines to thereby supply each in- 
struction in the group to be executed in parallel to 
the appropriate pipeline. 

Figure 1 is a block diagram illustrating a typical en- 
vironment for a preferred implementation of this in- 
vention; 

Figure 2 is a diagram illustrating the overall struc- 
ture of the instruction cache of Figure 1 ; 
Figure 3 is a diagram illustrating one embodiment 
of the associative crossbar; 
Figure 4 is a diagram illustrating another embodi- 
ment of the associative crossbar; and 
Figure 5 is a diagram illustrating another embodi- 
ment of the associative crossbar. 

[00171 Figure 1 illustrates the organization of the in- 
tegrated circuit chips by which the computing system is 
formed. As depicted, the system includes a first integrat- 
ed circuit 10 that includes a central processing unit, a 
floating point unit, and an instruction cache. 
[0018] in the preferred embodiment the instruction 
cache is a 16 kilobyte two-way set-associative 32 byte 
line cache. A set associative cache is one in which the 
lines (or blocks) can be placed only in a restricted set of 
locations. The line is first mapped into a set, but can be 
placed anywhere within that set. In a two-way set asso- 
ciative cache, two sets, or compartments, are provided, 
and each line can be placed in one compartment or the 
other. 

[0019] The system also includes a data cache chip 20 
that comprises a 32 kilobyte four-way set-associative 32 
byte line cache. The third chip 30 of the system includes 
a predecoder, a cache controller, and a memory control- 
ler. The predecoder and instruction cache are explained 
further below. For the purposes of this invention, the 
CPU, FPU, data cache, cache controller and memory 
controller all may be considered of conventional design. 
[0020] The communication paths among the chips are 
illustrated by arrows in Figure 1. As shown, the CPU/ 
FPU and instruction cache chip communicates over a 
32 bit wide bus 12 with the predecoder chip 30. The as- 
terisk is used to indicate that these communications are 



multiplexed so that a 64 bit word is communicated in two 
cycles. Chip 10 also receives information over 64 bit 
wide buses 14,16 from the data cache 20, and supplies 
information to the data cache 20 over three 32 bit wide 
5 buses 18. The predecoder decodes a 32 bit instruction 
received from the secondary cache into a 64 bit word, 
and supplies that 64 bit word to the instruction cache on 
chip 10. 

[0021] The cache controller on chip 30 is activated 
10 whenever a first level cache miss occurs. Then the 
cache controller either goes to main memory or to the 
secondary cache to fetch the needed information. In the 
preferred embodiment the secondary cache lines are 32 
bytes and the cache has an 8 kilobyte page size. 
15 [0022] The data cache chip 20 communicates with the 
cache controller chip 30 over another 32 bit wide bus. 
In addition, the cache controller chip 30 communicates 
over a 64 bit wide bus 32 with the DRAM memory, over 
a 128 bit wide bus 34 with a secondary cache, and over 
a 64 bit wide bus 36 to input/output devices. 
[0023] As will be described further below, the system 
shown in Figure 1 includes multiple pipelines able to op- 
erate in parallel on separate instructions which are dis- 
patched to these parallel pipelines simultaneously. In 
one embodiment the parallel instructions have been 
identified by the compiler and tagged with a pipeline 
identification tag indicative of the specific pipeline to 
which that instruction should be dispatched. 
[0024] In this system, an arbitrary number of instruc- 
tions can be executed in parallel. In one embodiment of 
this system the central processing unit includes eight 
functional units and is capable of executing eight in- 
structions in parallel. These pipelines are designated 
using the digits 0 to 7. Also, for this explanation each 
instruction word is assumed to be 32 bits (4 bytes) long. 
[0025] As briefly mentioned above, in the preferred 
embodiment the pipeline identifiers are associated with 
individual instructions in a set of instructions during com- 
pilation. In the preferred embodiment, this is achieved 
by compiling the instructions to be executed using a 
well-known compiler technology. During the compila- 
tion, the instructions are checked for data dependen- 
cies, dependence upon previous branch instructions, or 
other conditions that preclude their execution in parallel 
with other instructions. The result of the compilation is 
identification of a set or group of instructions which can 
be executed in parallel. In addition, in the preferred em- 
bodiment, the compiler determines the appropriate 
pipeline for execution of an individual instruction. This 
determination is essentially a determination of the type 
of instruction provided. For example, bad instructions 
will be sent to the bad pipeline, store instructions to the 
store pipeline, etc. The association of the instruction 
with the given pipeline can be achieved either by the 
compiler, or by later examination of the instruction itself, 
for example, during predecoding. 
[0026] Referring again to Figure 1 , in normal opera- 
tion the CPU will execute instructions from the instruc- 
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tion cache according to well-known principles. On an in- 
struction cache miss, however, a set of instructions con- 
taining the instruction missed is transferred from the 
main memory into the secondary cache and then into 
the primary instruction cache, or from the secondary 
cache to the primary instruction cache, where it occu- 
pies one line of the instruction cache memory. Because 
instructions are only executed out of the instruction 
cache, ail instructions ultimately undergo the following 
procedure. 

[0027] At the time a group of instructions is transferred 
into the instruction cache, the instruction words are pre- 
decoded by the predecoder 30. As part of the predecod- 
ing process, a multiple bit field prefix is added to each 
instruction based upon a tag added to the instruction by 
the compiler. This prefix gives the explicit pipe number 
of the pipeline to which that instruction will be routed. 
Thus, at the time an instruction is supplied from the pre- 
decoder to the instruction cache, each instruction will 
have a pipeline identifier. 

[0028] It may be desirable to implement the system of 
this invention on computer systems that already are in 
existence and therefore have instruction structures that 
have already been defined without available blank fields 
for the pipeline information. In this case, in another em- 
bodiment of this invention, the pipeline identifier infor- 
mation is supplied on a different clock cycle, then com- 
bined with the instructions in the cache or placed in a 
separate smaller cache. Such an approach can be 
achieved by adding a "no-op" instruction with fields that 
identify the pipeline for execution of the instruction, or 
by supplying the information relating to the parallel in- 
structions in another manner It therefore should be ap- 
preciated that the manner in which the instruction and 
pipeline identifier arrives at the crossbar to be proc- 
essed is somewhat arbitrary. I use the word "associated" 
herein to designate the concept that the pipeline identi- 
fiers are not required to have a fixed relationship to the 
instruction words. That is, the pipeline identifiers need 
not be embedded within the instructions themselves by 
the compiler. Instead they may arrive from another 
means, or on a different cycle. 
[0029] Figure 2 is a simplified diagram illustrating the 
secondary cache, the predecoder, and the instruction 
cache. This figure, as well as Figures 3, 4 and 5, are 
used to explain the manner in which the instructions 
tagged with the pipeline identifier are routed to their des- 
ignated instruction pipelines. 
[0030] In Figure 2, for illustration, assume that groups 
of instructions to be executed in parallel are fetched in 
a single transfer across a 256 bit (32 byte) wide path 
from a secondary cache 50 into the predecoder 60. As 
explained above, the predecoder prefixes the pipeline 
"P" field to the instruction. After predecoding the result- 
ing set of instructions is transferred into the primary in- 
struction cache 70. At the same time, a tag is placed into 
the tag field 74 for that line. 
[0031] In the preferred embodiment the instruction 
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cache operates as a conventional physically-addressed 
instruction cache. In the example depicted in Figure 2, 
the instruction cache will contain 512 bit sets of instruc- 
tions of eight instructions each, organized in two com- 
5 partments of 256 lines. 

[0032] Address sources for the instruction cache ar- 
rive at a multiplexer 80 that selects the next address to 
be fetched. Because preferably instructions are always 
machine words, the low order two address bits <1 :0> of 
10 the 32 bit address field supplied to multiplexer 80 are 
discarded. These two bits designate byte and half-word 
boundaries. Of the remaining 30 bits, the next three low 
order address bits <4:2>. which designate a particular 
instruction word in the set, are sent directly via bus 81 
15 to the associative crossbar. The next low eight address 
bits <12:5> are supplied over bus 82 to the instruction 
cache 70 where they are used to select one of the 256 
lines in the instruction cache. Finally, the remaining 19 
bits of the virtual address <31 : 1 3> are sent to the trans- 
lation lookaside buffer (TLB) 90. The TLB translates 
these bits into the high 19 bits of the physical address. 
The TLB then supplies them over bus 84 to the instruc- 
tion cache. In the cache they are compared with the tag 
of the selected line, to determine if there is a "hit" or a 
"miss" In the instruction cache. 
[0033] If there is a hit in the instruction cache, indicat- 
ing that the addressed instruction is present in the 
cache, then the selected set of instructions is transferred 
across the 512 bit wide bus 73 into the associative 
crossbar 100. The associative crossbar 100 then dis- 
patches the addressed instructions to the appropriate 

pipelines over buses 110, 111 117. Preferably the bit 

lines from the memory cells storing the bits of the in- 
struction are themselves coupled to the associative 
crossbar. This eliminates the need for numerous sense 
amplifiers, and allows the crossbar to operate on the 
lower voltage swing information from the cache line di- 
rectly, without the normally intervening driver circuitry to 
slow system operation. 

[0034] Figure 3 illustrates in more detail one embod- 
iment of the associative crossbar. A 51 2 bit wide register 
130. which represents the memory cells in a line of the 
cache (or can be a physically separate register), con- 
tains at least the set of instructions capable of being is- 
sued. For the purposes of illustration, register 130 is 
shown as containing up to eight instruction words W0 to 
W7. Using means described in the copending applica- 
tion referred to above, the instructions have been sorted 
into groups for parallel execution. For illustration here, 
assume the instructions in Group 1 are to be dispatched 
to pipelines 1 , 2 and 3; the instructions in Group 2 to 
pipelines 1 , 3 and 6; and the instructions in Group 3 to 
pipelines 1 and 6. The decoder select signal enables 
only the appropriate set of instructions to be executed 
in parallel, essentially allowing register 130 to contain 
more than just one set of instructions. Of course, by only 
using register 1 30 only for one set of parallel instructions 
at a time, the decoder select signal is not needed. 
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[0035] As shown in Figure 3, the crossbar switch itself 
consists of two sets of crossing pathways. In the hori- 
zontal direction are the pipeline pathways 180, 181,..., 
187. In the vertical direction are the instruction word 
paths, 190, 191 197. Each of these pipeline and in- 
struction pathways is themselves a bus for transferring 
the instruction word. Each horizontal pipeline pathway 
is coupled to a pipeline execution unit 200, 201 , 202, 
207. Each of the vertical instruction word pathways 1 90, 
1 91 , 1 97 Is coupled to an appropriate portion of reg- 
ister or cache line 130. 

[0036] The decoders 170, 1 71 177 associated 

with each instruction word pathway receive the 4 bit 
pipeline code from the instruction. Each decoder, for ex- 
ample decoder 170, provides eight 1 bit control lines as 
output. One of these control lines is associated with 
each pipeline pathway crossing of that instruction word 
pathway. Selection of a decoder as described with ref- 
erence to Figure 3 activates the output bit control line 
corresponding to that input pipe number. This signals 
the crossbar to close the switch between the word path 
associated with that decoder and the pipe path selected 
by that bit line. Establishing the cross connection be- 
tween these two pathways causes a selected instruction 
word to flow Into the selected pipeline. For example, de- 
coder 173 has received the pipeline bits for word W3. 
Word W3 has associated with it pipeline path 1 . The 
pipeline path 1 bits are decoded to activate switch 213 
to supply instruction word W3 to pipeline execution unit 
201 over pipeline path 1 81 . In a similar manner, the iden- 
tification of pipeline path 3 for decoder D4 activates 
switch 234 to supply instruction word W4 to pipeline path 
3. Finally, the identification of pipeline 6 for word W5 in 
decoder D5 activates switch 265 to transfer instruction 
word W5 to pipeline execution unit 206 over pipeline 
pathway 186. Thus, instructions W3, W4 and W5 are 
executed by pipes 201 , 203 and 206, respectively. The 
pipeline processing units 200, 201 , 207 shown in Fig- 
ure 3 can carry out desired operations. In a preferred 
embodiment of the Invention , each of the eight pipelines 
first includes a sense amplifier to detect the state of the 
signals on the bit lines from the crossbar. In one embod- 
iment the pipelines include first and second arithmetic 
logic units; first and second floating point units; first and 
second load units; a store unit and a control unit. The 
particular pipeline to which a given instruction word is 
dispatched will depend upon hardware constraints as 
well as data dependencies. 

[0037] Figure 4 is a diagram illustrating another em- 
bodiment of the associative crossbar. In Figure 4 nine 
pipelines 0 - 8 are shown coupled to the crossbar. The 
decode select is used to enable a subset of the instruc- 
tions in the register 130 for execution just as in the sys- 
tem of Figure 3. 

[0038] The execution ports that connect to the pipe- 
lines specified by the pipeline identification bits of the 
enabled instructions are then selected to multiplex out 
the appropriate instructions from the contents of the reg- 



ister. If one or more of the pipelines is not ready to re- 
ceive a new instruction, a set of hold latches at the out- 
put of the execution ports prevents any of the enabled 
instructions from issuing until the "busy" pipeline is free. 
5 Otherwise the instructions pass transparently through 
the hold latches into their respective pipelines. Accom- 
panying the output of each port is a "port valid" signal 
that indicates whether the port has valid information to 
issue to the hold latch. 
w [0039] Figure 5 illustrates an alternate embodiment 
for the invention where pipeline tags are not included 
with the instruction, but are supplied separately, or 
where the cache line itself is used as the register for the 
crossbar. In these situations, the pipeline tags may be 
is placed into a high speed separate cache memory 200. 
The output from this memory can then control the cross- 
bar in the same manner as described in conjunction with 
Figure 3. This approach eliminates the need for sense 
amplifiers between the instruction cache and the cross- 
bar. This enables the crossbar to switch very low voltage 
signals more quickly than higher level signals, and the 
need for hundreds of sense amplifiers is eliminated. To 
provide a higher level signal for control of the crossbar, 
sense amplifier 205 is placed between the pipeline tag 
cache 200 and the crossbar 100. Because the pipeline 
tag cache is a relatively small memory, however, it can 
operate more quickly than the instruction cache memo- 
ry, and the tags therefore are available in time to control 
the crossbar despite the sense amplifier between the 
cache 200 and the crossbar 100. Once the switching 
occurs in the crossbar, then the signals are amplified by 
sense amplifiers 210 before being supplied to the vari- 
ous pipelines for execution. 

[0040] The architecture described above provides 
many unique advantages to a system using this cross- 
bar. The crossbar described is extremely flexible, ena- 
bling instructions to be executed sequentially or in par- 
allel, depending entirely upon the "intelligence" of the 
compiler. Importantly, the associative crossbar relies 
upon the content of the message being decoded, not 
upon an external control circuit acting independently of 
the instructions being executed. In essence, the asso- 
ciative crossbar is self directed. 
[0041] Another important advantage of this system is 
that it allows for more intelligent compilers. Two instruc- 
tions which appear to a hardware decoder (such as in 
the prior art described above) to be dependent upon 
each other can be determined by the compiler not to be 
interdependent. For example, a hardware decoder 
would not permit two instructions R1 + R2 = R3 and R3 
+ R5 = R6 to be executed in parallel. A compiler, how- 
ever, can be "intelligent" enough to determine that the 
second R3 is a previous value of R3, not the one calcu- 
lated by R1 + R2, and therefore allow both instructions 
to Issue at the same time. This allows the software to 
be more flexible and faster. 

[0042] Although the foregoing has been a description 
of the preferred embodiment of the invention, it will be 
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apparent to those of skilled in the art the n umerous mod- 
ifications and variations may be made to the invention 
without departing from the scope as described herein. 
For example, arbitrary numbers of pipelines, arbitrary 
numbers of decoders, and different architectures may 
be employed, yet rely upon the system we have devel- 
oped. 



Claims 

1 . A method for operating a processor, comprising the 
steps of: 

storing in a memory a plurality of instructions, 
each instruction being one of a plurality of in- 
struction types, the instructions encoded in 
frames, each frame including a plurality of in- 
struction slots and template bits that specifies 
instruction group boundaries within the frame, 
with an Instruction group comprising a set of 
statically contiguous instructions that are exe- 
cuted concurrently; 

using a crossbar switch means (1 00) coupled 
to a plurality of execution units (0, ..„ 7) to issue 
instructions in the instruction group in parallel 
to execution units (0, 7) from the plurality of 
execution units (0. .... 7) in response to the tem- 
plate bits; 

wherein each execution unit (0, 7) of the plu- 
rality of execution units (0, .... 7) is one of a plu- 
rality of execution unit types; and 
wherein each instruction type is executed on 
one or more execution unit types. 

2. The method of claim 1 further wherein using the 
crossbar switch means (1 00) further comprises us- 
ing the crossbar switch means (1 00) to couple the 
instructions to appropriate execution unit types in 
response to the template bits. 

3. The method of claims 1 or 2 wherein the instruction 
types include integer instructions and floating-point 
instructions. 

4. The method of claim of one of the claims 1 to 3 
wherein the instruction types include load Instruc- 
tions and store instructions. 

5. The method of one of the claims 1 to 4 wherein the 
execution units include an arithmetic logic unit and 
a floating-point unit. 

6. The method of one of the claims 1 to 5 wherein the 
template bits comprises 4 bits. 

7. The method according to one of the claims 1 to 6, 
wherein a byte order of the instructions in the frame 



in the memory are in a little-endian format or in a 
big-endian format. 

8. The method according to one of the claims 1 to 7, 
5 wherein an instruction in the frame with the lowest 

memory address precedes an instruction in the 
frame with the highest memory address. 

9. The method according to one of the claims 1 to 8, 
10 wherein the frame comprise at least first, second, 

and third instruction slots. 

10. The method according to one of the claims 1 to 9, 
wherein the template bits are at least partly deter- 

15 mined at compile time. 

1 1 . A processor comprising: 

an instruction set including instructions which 
20 address registers, each instruction being one 

of a plurality of instruction types, the instruc- 
tions encoded in frames, each frame including 
a plurality of instruction slots and template bits 
that specifies instruction group boundaries 
25 within the frame, with an instruction group com- 

prising a set of statically contiguous instruc- 
tions that are executed concurrently; 

a plurality of execution units (0, 7), each 
30 execution unit (0, 7) being one of a plu- 

rality of execution unit types, wherein each 
instruction type is executed on one or more 
execution unit types; and 
a crossbar switch means (1 00) coupled to 
35 the plurality of execution units (0, ...,7),the 

crossbar switch means (1 00) configured to 
issue instructions in the instruction group 
in parallel to execution units (0, 7) from 

the plurality of execution units (0 7) in 

40 response to the template bits. 

12. The processor of claim 11 wherein the crossbar 
switch means (1 00) is also configured to couple the 
instruction slots to the execution unit types in re- 

45 sponse to the template bits. 

13. The processor of claims 11 or 12 wherein the in- 
struction types include integer instructions and 
floating-point instructions. 

50 

14. The processorof one of the claims 11 to13wherein 
the instruction types include load instructions and 
store instructions. 

55 1 5. The processor of one of the claims 1 1 to 1 4 wherein 
the execution units include an arithmetic logic unit 
and a floating-point unit. 
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16. The processor of one of the claims 11 to 15 wherein 
the template bits comprises 4 bits. 

17. The processor according to one of the claims 11 to 

16, further comprising a memory that stores the 
frames, a byte order of the frames in the memory 
being in a little-endian format or in a big-endian for- 
mat. 

18. The processor according to one of the claims 11 to 

1 7, wherein an instruction in the frame with the low- 
est memory address precedes an instruction in the 
frames with the highest memory address. 

19. The processor according to one of the claims 11 to 
18 s wherein the frame comprise at least first, sec- 
ond, and third instruction slots. 

20. The processor according to one of the claims 11 to 
19, wherein the template bits are at least partly de- 
termined at compile time. 

21. A cache memory comprising: 



logic unit and a floating-point unit. 

26. The cache memory of one of the claims 21 to 25 
wherein the template bits comprises 4 bits. 

5 

27. The cache memory according to one of the claims 
21 to 26, wherein a byte order of instructions in the 
frame of instructions are stored in a little-endian for- 
mat or in a big-endian format. 

10 

28. The cache memory according to one of the claims 
21 to 27, wherein an instruction in the frame with 
the lowest memory address precedes an instruction 
in the frame with the highest memory address. 

15 

29. The cache memory according to one of the claims 
21 to 28, wherein the frame comprise at least first, 
second, and third instruction slots. 

20 30. The cache memory according to one of the claims 
21 to 29, wherein the template bits are at least partly 
determined at compile time. 



a frame of instructions ; the frame including a 25 
plurality of instructions and template bits that 
specifies instruction group boundaries within 
the frame, each instruction being one of a plu- 
rality of instruction types, and an instruction 
group comprising a set of statically contiguous 30 
instructions that are executed concurrently; 



wherein each instruction type is to be execut- 
ed on an execution unit (0, 7) from a plurality of 

execution units (0, 7), each execution unit (0 35 

7) being one of a plurality of execution unit types; 
and 

wherein instructions in the instruction group 
are issued by crossbar switch means (100) in par- 
allel to execution units (0 7) from the plurality of *o 

execution units (0,..., 7) in response to the template 
bits. 

22. The cache memory of claim 21 wherein the instruc- 
tions are also issued by the crossbar switch means 45 
(100) to appropriate execution unit types in re- 
sponse to the template bits. 



23. The cache memory of claims 21 or 22 wherein the 
instruction types include integer instructions and so 
floating-point instructions. 

24. The cache memory of one of the claims 21 to 23 
wherein the instruction types include load instruc- 
tions and store instructions. ss 

25. The cache memory of one of the claims 21 to 24 
wherein the execution units include an arithmetic 
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